Predicting Ranked SCOP Domains by Mining Associations of Visual Contents in Distance Matrices
نویسندگان
چکیده
Protein tertiary structures are known to have significant correlations with their biological functions. To understand the information of the protein structures, Structural Classification of Protein (SCOP) Database, which is manually constructed by human experts, classifies similar protein folds in the same domain hierarchy. Even though this approach is believed to be more reliable than applying traditional alignment methods in structural classifications, it is labor intensive. In this paper, we build a non-parametric classifier to predict possible SCOP domains for unknown protein structures. With supervised learning, the algorithm first maps tertiary structures of training proteins into two-dimensional distance matrices, and then extracts signatures from visual contents of matrices. A knowledge discovery and data mining (KDD) process further discovers relevant patterns in training signatures of each SCOP domain by mining association rules. Finally, the quantity of rules whose patterns match signatures of unknown proteins determines predicted domains in a ranked order. We select 7,702 protein chains from 150 domains of SCOP database 1.67 release as labelled data using 10 fold cross validation. Experimental results show that the prediction accuracy is 91.27% for the top ranked domain and 99.22% for the top 5 ranked domains. The average response time takes 6.34 seconds, exhibiting reasonably high prediction accuracy and efficiency.
منابع مشابه
Distance-based identification of structure motifs in proteins using constrained frequent subgraph mining.
Structure motifs are amino acid packing patterns that occur frequently within a set of protein structures. We define a labeled graph representation of protein structure in which vertices correspond to amino acid residues and edges connect pairs of residues and are labeled by (1) the Euclidian distance between the C(alpha) atoms of the two residues and (2) a boolean indicating whether the two re...
متن کاملSpectroscopic Based Quantitative Mapping of Contaminant Elements in Dumped Soils of a Copper Mine
Possibility of mapping the distribution of Arsenic and Chromium in a mining area was investigated using combination of (VNIR) reflectance spectroscopy and geostatistical analysis. Fifty five soil samples were gathered from a waste dump at Sarcheshmeh copper mine and VNIR reflectance spectra were measured in a laboratory. Savitzky- Golay first derivative was used as the main pre-processing metho...
متن کاملDomain Fishing: a first step in protein comparative modelling
UNLABELLED To optimize the search for structural templates in protein comparative modelling, the query sequence is split into domains. The initial list of templates for each domain, extracted from PFAM plus PDB and SCOP, is then ranked according to sequence identity (%ID), coverage and resolution. If %ID is less than 30, secondary structure matching is used to filter out false templates. AVAI...
متن کاملProteinDBS: a real-time retrieval system for protein structure comparison
We have developed a web server (ProteinDBS) for the life science community to search for similar protein tertiary structures in real time. This system applies computer visualization techniques to extract the predominant visual patterns encoded in two-dimensional distance matrices generated from the three-dimensional coordinates of protein chains. When meaningful contents, represented in a multi...
متن کاملSCOPPI: a structural classification of protein–protein interfaces
SCOPPI, the structural classification of protein-protein interfaces, is a comprehensive database that classifies and annotates domain interactions derived from all known protein structures. SCOPPI applies SCOP domain definitions and a distance criterion to determine inter-domain interfaces. Using a novel method based on multiple sequence and structural alignments of SCOP families, SCOPPI presen...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006